129 research outputs found
Batch Reinforcement Learning from Crowds
A shortcoming of batch reinforcement learning is its requirement for rewards
in data, thus not applicable to tasks without reward functions. Existing
settings for lack of reward, such as behavioral cloning, rely on optimal
demonstrations collected from humans. Unfortunately, extensive expertise is
required for ensuring optimality, which hinder the acquisition of large-scale
data for complex tasks. This paper addresses the lack of reward in a batch
reinforcement learning setting by learning a reward function from preferences.
Generating preferences only requires a basic understanding of a task. Being a
mental process, generating preferences is faster than performing
demonstrations. So preferences can be collected at scale from non-expert humans
using crowdsourcing. This paper tackles a critical challenge that emerged when
collecting data from non-expert humans: the noise in preferences. A novel
probabilistic model is proposed for modelling the reliability of labels, which
utilizes labels collaboratively. Moreover, the proposed model smooths the
estimation with a learned reward function. Evaluation on Atari datasets
demonstrates the effectiveness of the proposed model, followed by an ablation
study to analyze the relative importance of the proposed ideas.Comment: 16 pages. Accepted by ECML-PKDD 202
Label Selection Approach to Learning from Crowds
Supervised learning, especially supervised deep learning, requires large
amounts of labeled data. One approach to collect large amounts of labeled data
is by using a crowdsourcing platform where numerous workers perform the
annotation tasks. However, the annotation results often contain label noise, as
the annotation skills vary depending on the crowd workers and their ability to
complete the task correctly. Learning from Crowds is a framework which directly
trains the models using noisy labeled data from crowd workers. In this study,
we propose a novel Learning from Crowds model, inspired by SelectiveNet
proposed for the selective prediction problem. The proposed method called Label
Selection Layer trains a prediction model by automatically determining whether
to use a worker's label for training using a selector network. A major
advantage of the proposed method is that it can be applied to almost all
variants of supervised learning problems by simply adding a selector network
and changing the objective function for existing models, without explicitly
assuming a model of the noise in crowd annotations. The experimental results
show that the performance of the proposed method is almost equivalent to or
better than the Crowd Layer, which is one of the state-of-the-art methods for
Deep Learning from Crowds, except for the regression problem case.Comment: 15 pages, 1 figur
Behavior Estimation from Multi-Source Data for Offline Reinforcement Learning
Offline reinforcement learning (RL) have received rising interest due to its
appealing data efficiency. The present study addresses behavior estimation, a
task that lays the foundation of many offline RL algorithms. Behavior
estimation aims at estimating the policy with which training data are
generated. In particular, this work considers a scenario where the data are
collected from multiple sources. In this case, neglecting data heterogeneity,
existing approaches for behavior estimation suffers from behavior
misspecification. To overcome this drawback, the present study proposes a
latent variable model to infer a set of policies from data, which allows an
agent to use as behavior policy the policy that best describes a particular
trajectory. This model provides with a agent fine-grained characterization for
multi-source data and helps it overcome behavior misspecification. This work
also proposes a learning algorithm for this model and illustrates its practical
usage via extending an existing offline RL algorithm. Lastly, with extensive
evaluation this work confirms the existence of behavior misspecification and
the efficacy of the proposed model.Comment: Accepted by AAAI 2023. Fixed errors in Fig. 4 presented in the
camera-ready version and Table
- …